Disclaimer

Do not repeat my commentary in your assignments. Any written work outside of the code chunks is just that. You wouldn’t copy and paste from another article, so do not plagerize the lab material without proper citation. I will take points off for said plagerism otherwise.


Load Data

library(tidyverse)
library(stargazer)
library(knitr)

execs <- read.csv("data/lab04_execs_data.csv")

Review of Stargazer

  • Stargazer Cheat Sheet
  • Be sure to use the option results='asis' in your code chunk, i.e. {r, results='asis'}
  • Also specify the option header=FALSE in your Stargazer code to omit the citation.

Pay Attention to Variable Types

# To Knit PDF's Change This to type="latex"
stargazer(execs, header=FALSE, type='html',
          font.size = "footnotesize", 
          digits = 2,
          title = "Descriptive Statistics of Study Variables" )
Descriptive Statistics of Study Variables
Statistic N Mean St. Dev. Min Max
id 1,000 3,052.00 803.00 1,702 4,476
year 1,000 2,009.00 0.00 2,009 2,009
salary 1,000 749.00 440.00 0.00 5,600.00
bonus 1,000 243.00 1,068.00 0.00 19,891.00
total_compensation 1,000 992.00 1,190.00 0.00 21,033.00
age 1,000 56.00 7.30 37 89

Note that several of the variables are missing from the table because they are not numeric.

typeof(execs$gender)
## [1] "integer"
typeof(execs$ethnicity)
## [1] "integer"
typeof(execs$ceo)
## [1] "integer"

Run the regressions properly for your independent variables. If it is a categorical or ordered variable such as gender, race, or education, model it as a series of dummy variables with a reference group. Do not model these types of covariates as continuous.

The exception is with ordered variables with a large number of categories. If education has 20 possibilities, such as 1 year of education, 2 years of education \(\hdots\), then it is generally okay to use that variable as continuous. If education is “Less than HS,” “HS/GED,” “Some College,” and “College Degree,” you need to model that as a categorical variable with a reference group.

Suppose The Data Was Not Already Encoded as Factor

execs_numeric <- execs %>% 
  mutate(gender = as.numeric(gender),
         ethnicity = as.numeric(ethnicity),
         ceo = as.numeric(ceo))

Wrong Way to Run Regression

#Salary by Gender
lm_1 <- lm(salary ~ gender, data=execs_numeric)

#Salary by Age
lm_2 <- lm(salary ~ ceo, data=execs_numeric)

#Salary by Race/Ethnicity
lm_3 <- lm(salary ~ ethnicity, data=execs_numeric)


#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Incorrect Linear Models of Executive Compensation",
          star.cutoffs = c(0.05, 0.01, 0.001))
Incorrect Linear Models of Executive Compensation
Dependent variable:
salary
(1) (2) (3)
gender 135.00*
(58.00)
ceo -254.00***
(28.00)
ethnicity -95.00***
(13.00)
Constant 486.00*** 1,094.00*** 1,096.00***
(114.00) (40.00) (50.00)
Observations 1,000 1,000 1,000
R2 0.01 0.08 0.05
Adjusted R2 0.004 0.08 0.05
Residual Std. Error (df = 998) 439.00 423.00 429.00
F Statistic (df = 1; 998) 5.40* 83.00*** 51.00***
Note: p<0.05; p<0.01; p<0.001

The only real difference that makes the approach incorrect is how it is encoded. Gender, CEO, and Ethnicity are treated was numeric. We know this because the data type for each is currently numeric (but should be factor).

Contrast these models with the corrected version with factors:

#Salary by Gender
lm_1 <- lm(salary ~ gender, data=execs)

#Salary by Age
lm_2 <- lm(salary ~ ceo, data=execs)

#Salary by Race/Ethnicity
lm_3 <- lm(salary ~ ethnicity, data=execs)


#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Correct Linear Models of Executive Compensation",
          star.cutoffs = c(0.05, 0.01, 0.001))
Correct Linear Models of Executive Compensation
Dependent variable:
salary
(1) (2) (3)
genderMALE 135.00*
(58.00)
ceoOther Executive -254.00***
(28.00)
ethnicityASIAN -133.00
(151.00)
ethnicityCAUCASIAN -73.00
(116.00)
ethnicityHISPANIC -371.00
(444.00)
ethnicityUNKNOWN -282.00*
(117.00)
Constant 621.00*** 839.00*** 896.00***
(57.00) (17.00) (115.00)
Observations 1,000 1,000 1,000
R2 0.01 0.08 0.05
Adjusted R2 0.004 0.08 0.05
Residual Std. Error 439.00 (df = 998) 423.00 (df = 998) 429.00 (df = 995)
F Statistic 5.40* (df = 1; 998) 83.00*** (df = 1; 998) 14.00*** (df = 4; 995)
Note: p<0.05; p<0.01; p<0.001

Basic Multiple Regression

Now we will run a basic multiple regression model using OLS. The model will use several independent variables and covariates.

Multiple Regression

# Salary by Age, Gender, Ethnicity, and CEO 
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)

#Data Table
stargazer(lm_1, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
salary
age 4.00*
(1.80)
genderMALE 84.00
(56.00)
ethnicityASIAN -150.00
(146.00)
ethnicityCAUCASIAN -114.00
(112.00)
ethnicityHISPANIC -276.00
(429.00)
ethnicityUNKNOWN -289.00*
(113.00)
ceoOther Executive -227.00***
(28.00)
Constant 701.00***
(153.00)
Observations 1,000
R2 0.12
Adjusted R2 0.11
Residual Std. Error 414.00 (df = 992)
F Statistic 19.00*** (df = 7; 992)
Note: p<0.05; p<0.01; p<0.001

Note that the \(R^2\) is higher than in the single linear regression, indicating that we explain more variance.

Changing Reference Groups

Currently, the reference groups are arguably not the ideal ones. We might want males to be the reference group so we can see the effect for females; whites to be the reference race, so we can see the effect for African Americans; and other executives to be the reference group, so we can see the effect for CEOs. One way to do this is recoding the data:

kable(table(execs$ethnicity, execs$ceo))
CEO Other Executive
AFRICAN-AMERICAN 8 6
ASIAN 12 7
CAUCASIAN 430 183
HISPANIC 0 1
UNKNOWN 195 158
execs <- execs %>% 
  mutate(gender = factor(gender, levels = c("MALE", "FEMALE")),
         ethnicity = factor(ethnicity, 
                            levels = c("CAUCASIAN", "AFRICAN-AMERICAN", "HISPANIC", "ASIAN", "UNKNOWN")),
         ceo = factor(ceo,  levels = c("Other Executive", "CEO")))

knitr::kable(table(execs$ethnicity, execs$ceo))
Other Executive CEO
CAUCASIAN 183 430
AFRICAN-AMERICAN 6 8
HISPANIC 1 0
ASIAN 7 12
UNKNOWN 158 195

More information about changing factor order is available here.

Results After Changing Factor Order

# Salary by Age, Gender, Ethnicity, and CEO 
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)

#Data Table
stargazer(lm_1, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
salary
age 4.00*
(1.80)
genderFEMALE -84.00
(56.00)
ethnicityAFRICAN-AMERICAN 114.00
(112.00)
ethnicityHISPANIC -161.00
(415.00)
ethnicityASIAN -36.00
(97.00)
ethnicityUNKNOWN -174.00***
(28.00)
ceoCEO 227.00***
(28.00)
Constant 443.00***
(107.00)
Observations 1,000
R2 0.12
Adjusted R2 0.11
Residual Std. Error 414.00 (df = 992)
F Statistic 19.00*** (df = 7; 992)
Note: p<0.05; p<0.01; p<0.001

Adding a Variable Transformation

We might expect that age does not have a strictly linear relationship with salary.

ggplot(execs) +
  geom_point(aes(age, salary, color=gender), alpha = 0.25) +
  geom_smooth(mapping = aes(age, salary), method="loess") +
  xlab("Age") + 
  ylab("Salary in $1,000's") +
  ggtitle("Executive Salary by Age and Gender") +
  theme(plot.title = element_text(hjust = 0.5)) 

Add Quadratic and Cubic Terms

execs <- execs %>% 
  mutate(age_sq = age^2,
         age_cb = age^3)

Do We Have Enough Data For Each Categorical Variable

library(pander)
pander(summary(execs$gender))
MALE FEMALE
940 60
pander(summary(execs$ethnicity))
CAUCASIAN AFRICAN-AMERICAN HISPANIC ASIAN UNKNOWN
613 14 1 19 353
pander(summary(execs$ceo))
Other Executive CEO
355 645

We only have one Hispanic in the dataset, we should drop this to avoid future problems with the analysis.

execs <- filter(execs, ethnicity!="HISPANIC")

New Models

# Salary by Age, Gender, Ethnicity, and CEO 
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)

# Adding Age Squared
lm_2 <- lm(salary ~ age + age_sq + gender + ethnicity + ceo, data=execs)

# Adding Age Cubed
lm_3 <- lm(salary ~ age + age_sq + age_cb + gender + ethnicity + ceo, data=execs)

#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
salary
(1) (2) (3)
age 4.00* 40.00* 258.00*
(1.80) (17.00) (107.00)
age_sq -0.31* -4.00*
(0.14) (1.80)
age_cb 0.02*
(0.01)
genderFEMALE -84.00 -92.00 -102.00
(56.00) (56.00) (56.00)
ethnicityAFRICAN-AMERICAN 114.00 108.00 96.00
(112.00) (112.00) (112.00)
ethnicityASIAN -36.00 -34.00 -39.00
(97.00) (97.00) (97.00)
ethnicityUNKNOWN -174.00*** -167.00*** -166.00***
(28.00) (28.00) (28.00)
ceoCEO 227.00*** 220.00*** 219.00***
(28.00) (28.00) (28.00)
Constant 443.00*** -602.00 -4,802.00*
(107.00) (483.00) (2,092.00)
Observations 999 999 999
R2 0.12 0.12 0.13
Adjusted R2 0.11 0.12 0.12
Residual Std. Error 414.00 (df = 992) 413.00 (df = 991) 413.00 (df = 990)
F Statistic 23.00*** (df = 6; 992) 20.00*** (df = 7; 991) 18.00*** (df = 8; 990)
Note: p<0.05; p<0.01; p<0.001

Examining Influence and Variables: Histograms

#Melt from Reshape2
library(reshape2)

hist_data <- execs %>% 
  select(-c(id, year)) %>% 
  mutate(gender = as.numeric(gender),
         ethnicity = as.numeric(ethnicity),
         ceo = as.numeric(ceo)) %>% 
  melt()

ggplot(hist_data, aes(x = value)) +
  facet_wrap(~variable, scales = "free_x") +
  geom_histogram()

These histograms underscore several important points. First, they show what was indicated earlier, we should model gender, ethnicity, and ceo as categorical variables versus numeric. Second, we should consider using a log transformation of our dependent variable salary, (as wel as bonus and total_compensation).

Before we turn to this, let’s examine the data existing models for issues with outliers, influential observations, residual normality, homoskedasticity, multicolinearity, and non-independence of errors. More information about these diagonostics can be found here:

Outliers

We can assess outliers and several other diagnostics using the library car.

library(car)
fit <- lm_1 # Establish which model to evaluate

outlierTest(fit) # Bonferonni p-value for most extreme obs
##     rstudent                  unadjusted p-value
## 30      12.2 0.000000000000000000000000000000052
## 703     12.2 0.000000000000000000000000000000052
## 924      7.2 0.000000000000904130000000000005699
## 677      5.7 0.000000012012000000000000292349959
## 353      5.4 0.000000094697000000000004171347133
## 14       4.5 0.000008292799999999999612876680488
## 377      4.5 0.000008292799999999999612876680488
## 633      4.5 0.000008292799999999999612876680488
## 885      4.1 0.000041671000000000001334092558647
##                         Bonferonni p
## 30  0.000000000000000000000000000052
## 703 0.000000000000000000000000000052
## 924 0.000000000903229999999999992167
## 677 0.000012000000000000000304010289
## 353 0.000094601999999999999892835723
## 14  0.008284500000000000197175609173
## 377 0.008284500000000000197175609173
## 633 0.008284500000000000197175609173
## 885 0.041628999999999999337418898904

This test shows we have several outliers. We might consider ommiting some of these in future models.

qqPlot(fit, main="QQ Plot") #qq plot for studentized resid 

This plot similarly shows we have considerable outliers in the data.

leveragePlots(fit) # leverage plots

Illustrates several outliers in modeling salary relative to several variables. We might also consider influential observations using AVPlots.

Influential Observations: Added Variable Plots

library(car)
avPlots(fit)

# Cook's D plot
# identify D values > 4/(n-k-1) 
cutoff <- 4/((nrow(execs)-length(fit$coefficients)-2)) 
plot(fit, which=4, cook.levels=cutoff)

Here, we can see several observations that are influential. We might want to consider removing these observations.

removeRows <- function(rowNum, data) {
      newData <- data[-rowNum, , drop = FALSE]
      rownames(newData) <- NULL
      newData
}

execs_example <- removeRows(c(30, 505, 703), execs)

What does the plot look like now?

lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs_example)

# Cook's D plot
# identify D values > 4/(n-k-1) 
cutoff <- 4/((nrow(execs)-length(lm_1$coefficients)-2)) 
plot(lm_1, which=4, cook.levels=cutoff)

Notice that we now have new outliers that appear. This is typically the case with the removal of outliers. Once we remove the most extreme values, new values will become the outliers for the new distribution.

Nonlinearity

# Evaluate Nonlinearity
# component + residual plot 
crPlots(fit)

Heteroskedasticity

ncvTest applies a test of heteroskedasticity, known as a Breusch-Pagan Test. There is some good discussion of this relative to the test applied in Stata as well as R.

# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(fit)

# non-constant error variance test
ncvTest(fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 59    Df = 1     p = 0.000000000000016

From this, we would conclude there is heteroskedasticity (which we would like to avoid).

Multicolinearity

Generally, we want to avoid multicolinearity since it can adversely affect model results. In some cases, you will witness coefficients with the opposite sign in a highly collinear model. VIF is one test for this. VIF’s > 10 or > 2 of the square-root of VIF often indicate a problem. A typical exception is for transformations of the same variable, such as a square or cube of a variable. Compare the VIF’s for Model 1 and Model 2 with and without the square term.

#Log LM 1
vif(lm_1) # variance inflation factors 
##           GVIF Df GVIF^(1/(2*Df))
## age          1  1               1
## gender       1  1               1
## ethnicity    1  3               1
## ceo          1  1               1
sqrt(vif(lm_1)) > 2 # problem?
##            GVIF    Df GVIF^(1/(2*Df))
## age       FALSE FALSE           FALSE
## gender    FALSE FALSE           FALSE
## ethnicity FALSE FALSE           FALSE
## ceo       FALSE FALSE           FALSE
#Log LM 2
vif(lm_2) # variance inflation factors 
##           GVIF Df GVIF^(1/(2*Df))
## age       86.3  1             9.3
## age_sq    86.5  1             9.3
## gender     1.0  1             1.0
## ethnicity  1.0  3             1.0
## ceo        1.1  1             1.0
sqrt(vif(lm_2)) > 2 # problem?
##            GVIF    Df GVIF^(1/(2*Df))
## age        TRUE FALSE            TRUE
## age_sq     TRUE FALSE            TRUE
## gender    FALSE FALSE           FALSE
## ethnicity FALSE FALSE           FALSE
## ceo       FALSE FALSE           FALSE

Recalling the prior histograms, some of the issue may be the distribution of salary and age. Let’s transform the data using log transforms.

New Models with Log Transforms

# natural log transform salary, age, age_sq, age_cb
execs <- execs %>% 
  mutate(log_salary = log(salary+1),
         log_age = log(age),
         log_age_sq = log_age^2,
         log_age_cb = log_age^3)

 
# remove observations suggested by prior Cook's test
execs <- removeRows(c(30, 505, 703), execs)

# Salary by Age, Gender, Ethnicity, and CEO 
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)

# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)

# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)

#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_3, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
log_salary
(1) (2) (3)
log_age -0.10 11.00 420.00*
(0.20) (7.60) (187.00)
log_age_sq -1.40 -103.00*
(0.94) (46.00)
log_age_cb 8.40*
(3.80)
genderFEMALE -0.03 -0.05 -0.06
(0.11) (0.11) (0.11)
ethnicityAFRICAN-AMERICAN 0.31 0.29 0.27
(0.22) (0.22) (0.22)
ethnicityASIAN 0.003 0.002 -0.01
(0.19) (0.19) (0.19)
ethnicityUNKNOWN -0.21*** -0.20*** -0.20***
(0.05) (0.05) (0.05)
ceoCEO 0.40*** 0.38*** 0.39***
(0.05) (0.05) (0.05)
Constant 6.70*** -17.00 -564.00*
(0.81) (15.00) (252.00)
Observations 996 996 996
R2 0.08 0.08 0.08
Adjusted R2 0.07 0.07 0.08
Residual Std. Error 0.81 (df = 989) 0.81 (df = 988) 0.80 (df = 987)
F Statistic 14.00*** (df = 6; 989) 12.00*** (df = 7; 988) 11.00*** (df = 8; 987)
Note: p<0.05; p<0.01; p<0.001

Compare QQPlots for Salary and Age versus Logs

library(car)

#Salary
qqPlot(execs$salary, main="QQ Plot Salary")

qqPlot(execs$log_salary, main="QQ Plot Log Salary")

#Age
qqPlot(execs$age, main="QQ Plot Age")

qqPlot(execs$log_age, main="QQ Plot Log Age")

We can now even more clearly see the outliers. Let’s do a scatterplot:

ggplot(execs, aes(age, log_salary)) +
  geom_point(alpha=0.2)

Recall from the original log transformation that we added 1 to avoid dropping those with zero salary. Clearly, these were outliers. Let’s correct this and omit missing salaries with the log transform.

Removing Outliers

Since the outliers are all log_salaries less than 0, let’s remove them:

execs <- execs %>% 
  filter(log_salary > 0,
         complete.cases(.))


qqPlot(execs$log_salary, main="QQ Plot Log Salary")

ggplot(execs, aes(age, log_salary)) +
  geom_point(alpha=0.2)

Checking Models Again

How have the regressions changed?

# Salary by Age, Gender, Ethnicity, and CEO 
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)

# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)

# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)

#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_3, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
log_salary
(1) (2) (3)
log_age 0.13 10.00 312.00*
(0.16) (6.00) (149.00)
log_age_sq -1.30 -76.00*
(0.75) (37.00)
log_age_cb 6.20*
(3.00)
genderFEMALE -0.08 -0.09 -0.10
(0.09) (0.09) (0.09)
ethnicityAFRICAN-AMERICAN 0.27 0.26 0.24
(0.17) (0.17) (0.17)
ethnicityASIAN -0.03 -0.03 -0.04
(0.15) (0.15) (0.15)
ethnicityUNKNOWN -0.22*** -0.21*** -0.22***
(0.04) (0.04) (0.04)
ceoCEO 0.34*** 0.33*** 0.33***
(0.04) (0.04) (0.04)
Constant 5.80*** -15.00 -419.00*
(0.64) (12.00) (200.00)
Observations 990 990 990
R2 0.10 0.10 0.11
Adjusted R2 0.10 0.10 0.10
Residual Std. Error 0.64 (df = 983) 0.64 (df = 982) 0.64 (df = 981)
F Statistic 18.00*** (df = 6; 983) 16.00*** (df = 7; 982) 15.00*** (df = 8; 981)
Note: p<0.05; p<0.01; p<0.001

From these new results, it appears that log_lm_2 is the best model. Let’s do some further outlier tests on the model:

# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_1)

# non-constant error variance test
ncvTest(log_lm_1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3    Df = 1     p = 0.083
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_2)

# non-constant error variance test
ncvTest(log_lm_2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.9    Df = 1     p = 0.089
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_3)

# non-constant error variance test
ncvTest(log_lm_3)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.0013    Df = 1     p = 0.97

Rechecking Outliers

So now each model is no longer heteroskedastic, and model 3 appears to be performing the best. Do we still have outliers for each model?

# Bonferonni p-value for most extreme obs (Log Models 1-3)
outlierTest(log_lm_1) 
##     rstudent             unadjusted p-value                Bonferonni p
## 306      -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 489      -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 775      -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 92       -10 0.0000000000000000000000872250 0.0000000000000000000863530
outlierTest(log_lm_2)
##     rstudent             unadjusted p-value                Bonferonni p
## 306      -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 489      -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 775      -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 92       -10 0.0000000000000000000000559110 0.0000000000000000000553520
outlierTest(log_lm_3) 
##     rstudent             unadjusted p-value               Bonferonni p
## 306      -11 0.0000000000000000000000000061 0.000000000000000000000006
## 489      -11 0.0000000000000000000000000061 0.000000000000000000000006
## 775      -11 0.0000000000000000000000000061 0.000000000000000000000006
## 92       -10 0.0000000000000000000000800900 0.000000000000000000079289

Each test reveals the same outliers we might want to consider removing. One approach is removing them.

dim(execs)
## [1] 990  15
execs_example <- removeRows(c(306, 489, 775, 92), execs)

Robust Regression

Another approach, however is using robust regression, which downweights outliers.

library(MASS)


# Salary by Age, Gender, Ethnicity, and CEO 
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)

# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)

# Adding Age Squared - Ethnicity
log_lm_2_1 <- lm(log_salary ~ log_age + log_age_sq + gender + ceo, data=execs)


# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)

# Model Without Ethnicity
log_lm_4 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ceo, data=execs)

# Robust Regression Log Model 4
robust_log_lm_4 <- rlm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ceo, data=execs)

robust_log_lm_5 <- rlm(log_salary ~ log_age + log_age_sq + gender + ceo, data=execs)

robust_log_lm_6 <- rlm(log_salary ~ log_age + gender + ceo, data=execs)


#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_2_1, log_lm_3, log_lm_4, robust_log_lm_4, robust_log_lm_5, robust_log_lm_6, header=FALSE, 
          type = 'html', 
          font.size = "footnotesize", 
          digits = 2,
          title = "Multiple Regression of Executive Salary",
          star.cutoffs = c(0.05, 0.01, 0.001))
Multiple Regression of Executive Salary
Dependent variable:
log_salary
OLS robust
linear
(1) (2) (3) (4) (5) (6) (7) (8)
log_age 0.13 10.00 14.00* 312.00* 317.00* 186.00 15.00*** 0.42***
(0.16) (6.00) (6.00) (149.00) (150.00) (105.00) (4.20) (0.11)
log_age_sq -1.30 -1.70* -76.00* -77.00* -44.00 -1.80***
(0.75) (0.75) (37.00) (37.00) (26.00) (0.53)
log_age_cb 6.20* 6.20* 3.50
(3.00) (3.10) (2.20)
genderFEMALE -0.08 -0.09 -0.07 -0.10 -0.08 -0.10 -0.10 -0.08
(0.09) (0.09) (0.09) (0.09) (0.09) (0.06) (0.06) (0.06)
ethnicityAFRICAN-AMERICAN 0.27 0.26 0.24
(0.17) (0.17) (0.17)
ethnicityASIAN -0.03 -0.03 -0.04
(0.15) (0.15) (0.15)
ethnicityUNKNOWN -0.22*** -0.21*** -0.22***
(0.04) (0.04) (0.04)
ceoCEO 0.34*** 0.33*** 0.36*** 0.33*** 0.36*** 0.37*** 0.37*** 0.38***
(0.04) (0.04) (0.04) (0.04) (0.04) (0.03) (0.03) (0.03)
Constant 5.80*** -15.00 -22.00 -419.00* -429.00* -254.00 -24.00** 4.60***
(0.64) (12.00) (12.00) (200.00) (202.00) (141.00) (8.50) (0.45)
Observations 990 990 990 990 990 990 990 990
R2 0.10 0.10 0.08 0.11 0.08
Adjusted R2 0.10 0.10 0.07 0.10 0.08
Residual Std. Error 0.64 (df = 983) 0.64 (df = 982) 0.65 (df = 985) 0.64 (df = 981) 0.65 (df = 984) 0.41 (df = 984) 0.42 (df = 985) 0.43 (df = 986)
F Statistic 18.00*** (df = 6; 983) 16.00*** (df = 7; 982) 21.00*** (df = 4; 985) 15.00*** (df = 8; 981) 18.00*** (df = 5; 984)
Note: p<0.05; p<0.01; p<0.001